Nicole Hamler

"On my honor, as a student, I have neither given nor received unauthorized aid on this academic work."

Geospatial Analysis using GeoPandas

1. GeoPandas Basics:

  • GeoPandas is an extension to Pandas that allows for spatial operations and analytics using the shapely library.
  • GeoPandas should be used for exploratory data analysis when using Jupyter notebooks and when working with discrete spatial data types.
  • Six classes of geometric objects:
    • Single entity:
      • Point
      • Line
      • Polygon
    • Homogeneous entity collections:
      • Multi-Point
      • Multi-Line (MultiLineString)
      • Multi-Polygon
  • Corresponding with Pandas' Series, GeoPandas has a GeoSeries which is the building block for the GeoDataFrame
    • GeoSeries made up of an index and geometry type which is a shapely geometry object (e.g. object.area, object.bounds, etc.)

Two types of geo data

Raster data vs. vector data:

  • Raster data is based on satellite imagery
    • Pixel based
    • GDAL uses raster data
  • Vector data is based on coordinates and their relationships
    • Geometric object based (i.e. point, line, etc.)
    • GEOS uses vector data
  • GeoPandas libraries generally work with vector data and use the beforementioned geometric object classes also referred to as OGC Simple Features. Each vector feature will have attributes attached to it that describes that feature.
In [1]:
from IPython.display import Image
Image(filename="data\\raster_vector.jpg") 
Out[1]:

Libraries generally used with GeoPandas:

  • Matplotlib
  • Fiona
    • Interface to GDAL/OGR
    • Allows GeoPandas to read and write many different GIS file formats
  • Shapely
    • Interface to GEOS
  • Seaborn
  • Descartes
  • Pyproj - pythonic interface to PROJ.4
In [2]:
%matplotlib inline

from __future__ import (absolute_import, division, print_function)
import os

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

import seaborn as sns
plt.style.use('bmh')

import pandas as pd
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame
from geopandas.tools import sjoin

2. Shapely and GeoPandas

The Shapely library allows manipulation and analysis of geometric objects:

In [3]:
# Import of geometric objects for their creation and visual presentation
from shapely.geometry import Point, LineString, Polygon
In [4]:
# Creating geometric objects with values
point = Point(1,1)
line = LineString([(0,0),(5,9),(10,8)])
poly = line.buffer(1)
In [5]:
# Does the polygon contain the point?
poly.contains(point)
Out[5]:
True

GeoSeries and Geometric Objects

A GeoSeries from a list of the shapely point objects can be created using the point constructor:

  • A GeoSeries is a series that holds the geometry objects
In [6]:
# Assigning points to a GeoSeries:
gs = GeoSeries([Point(-10, 15), Point(-5, 16), Point(10, 2)])
gs
Out[6]:
0    POINT (-10 15)
1     POINT (-5 16)
2      POINT (10 2)
dtype: object
In [7]:
# Points can now be plotted on a chart:
gs.plot(marker='*', color='red', markersize=100, figsize=(4, 4))
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0xc618be0>
In [8]:
# One can also assign a polygon to the GeoSeries:
gs1 = GeoSeries(poly)
gs1.plot(color = 'blue', markersize = 100, figsize = (4,4))
plt.xlim([-5, 15])
plt.ylim([-5, 15]);
In [9]:
# Assigning the line to the GeoSeries:
gs2 = GeoSeries(line)
gs2.plot(color = 'blue', markersize = 100, figsize = (4,4))
plt.xlim([0, 15])
plt.ylim([0, 15]);

3. Datasets included in GeoPandas

Naturalearth dataset can be referenced in GeoPandas

In [10]:
# Getting the GeoDataFrame and assigning to variable:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(2)
Out[10]:
pop_est continent name iso_a3 gdp_md_est geometry
0 28400000.0 Asia Afghanistan AFG 22270.0 POLYGON ((61.21081709172574 35.65007233330923,...
1 12799293.0 Africa Angola AGO 110300.0 (POLYGON ((16.32652835456705 -5.87747039146621...
  • The GeoDataFrame is comparable to the Pandas DataFrame, however, it includes a 'geometry' column. This geometric attribute returns a GeoSeries (series that holds the geometric objects).
In [11]:
# Plot of NaturalEarth Dataset:
world.plot(figsize=(10,20))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0xcaff5c0>

GeoPandas allows us to show the GDP per capita of the countries by referencing the specific column:

In [12]:
# Plot by GDP per capita:
world = world[(world.pop_est>0)]

world['gdp_per_cap'] = world.gdp_md_est / world.pop_est
world.plot(column='gdp_per_cap', cmap = 'pink', figsize = (10,20));

4. Shapefiles

Shapefiles are datasets that store geometric and location attributes of geographic features (e.g. coordinates).

  • Map of Germany obtained through dataset downloaded from: http://www.diva-gis.org/gdata
  • GeoPandas has function gpd.read comparable to Pandas which allows program to read in shape files for further spatial analysis

a) Simple Shapefile containing one geometric object

In [13]:
germany = gpd.read_file("data\DEU_adm0.shp")
germany
Out[13]:
ID_0 ISO NAME_0 OBJECTID_1 ISO3 NAME_ENGLI NAME_ISO NAME_FAO NAME_LOCAL NAME_OBSOL ... CARICOM EU CAN ACP Landlocked AOSIS SIDS Islands LDC geometry
0 86 DEU Germany 62 DEU Germany GERMANY Germany Deutschland ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 (POLYGON ((8.70837306976324 47.71555709838862,...

1 rows × 71 columns

Data types of contained data in the geometry column can be obtained using ".geom_type"

In [14]:
germany.geom_type.head(2)
Out[14]:
0    MultiPolygon
dtype: object
  • This file contains one multipolygon as a geometric object.

Object can then be plotted by using the .plot extension:

In [15]:
germany.plot(cmap='plasma', figsize=(4,6));

b) Shapefile containing multiple geometric objects

Let's import a file that depicts the different districts in Germany:

  • The GeoDataFrame shows muliple rows and columns with the geometry column being the last.
In [16]:
districts = gpd.read_file("data\DEU_adm2.shp")
districts.head()
Out[16]:
ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 TYPE_2 ENGTYPE_2 NL_NAME_2 VARNAME_2 geometry
0 86 DEU Germany 1 Baden-Württemberg 1 Freiburg Regierungsbezirk Administrative Region Friburgo|Fribourg (POLYGON ((8.70837306976324 47.71555709838862,...
1 86 DEU Germany 1 Baden-Württemberg 2 Karlsruhe Regierungsbezirk Administrative Region POLYGON ((9.457980155944824 49.64889144897461,...
2 86 DEU Germany 1 Baden-Württemberg 3 Stuttgart Regierungsbezirk Administrative Region Estugarda POLYGON ((9.650460243225155 49.7763404846192, ...
3 86 DEU Germany 1 Baden-Württemberg 4 Tübingen Regierungsbezirk Administrative Region Tubinga POLYGON ((9.951169013977165 48.63167953491211,...
4 86 DEU Germany 2 Bayern 5 Mittelfranken Regierungsbezirk Administrative Region Franconia Central|Middle Franconia|Média Franc... POLYGON ((10.74197006225597 49.77238845825207,...
In [17]:
# Accessing the geometry column only is similar to how it is done in Pandas:
districts.geometry.head()
Out[17]:
0    (POLYGON ((8.70837306976324 47.71555709838862,...
1    POLYGON ((9.457980155944824 49.64889144897461,...
2    POLYGON ((9.650460243225155 49.7763404846192, ...
3    POLYGON ((9.951169013977165 48.63167953491211,...
4    POLYGON ((10.74197006225597 49.77238845825207,...
Name: geometry, dtype: object

Type of the shapefile and data types contained in the file:

In [18]:
# Type of file: 
type(districts)
Out[18]:
geopandas.geodataframe.GeoDataFrame
  • Type of the districts file is a GeoDataFrame - essentially this is a Pandas dataframe with a geometry column.
In [19]:
# Data type of geometry column:
type(districts.geometry)
Out[19]:
geopandas.geoseries.GeoSeries
  • Data Type contained in the geometry column are GeoSeries.
In [20]:
# Accessing a specific value's data type within the GeoSeries:
type(districts.geometry[0])
Out[20]:
shapely.geometry.multipolygon.MultiPolygon
  • Data Type contained in the GeoSeries are Polygons

Geometric analysis

Calculating the area of districts in the file:

In [21]:
# To get the area of the different districts:
districts.geometry.area.head()
Out[21]:
0    1.119821
1    0.852975
2    1.298480
3    1.079738
4    0.893166
dtype: float64

Creating / Obtaining a separate point to analyze various aspects in relation to the districts file:

In [22]:
# Assigning coordinates of point of the geographic location for Rothenburg ob der Tauber (coordinates obtained from Google)
# Variable_Name = Point(location.longitude, location.latitude)
RBG = Point(10.1867, 49.33802)
districts.contains(RBG).head()
Out[22]:
0    False
1    False
2    False
3    False
4     True
dtype: bool
  • Returns a large number of rows and the details of the district are not depicted in this output.

Showing only the specific district (polygon) which contains the point in the output:

In [23]:
# Obtaining specific information on the geometry object by using .contains(Point)
districts[districts.contains(RBG)]
Out[23]:
ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 TYPE_2 ENGTYPE_2 NL_NAME_2 VARNAME_2 geometry
4 86 DEU Germany 2 Bayern 5 Mittelfranken Regierungsbezirk Administrative Region Franconia Central|Middle Franconia|Média Franc... POLYGON ((10.74197006225597 49.77238845825207,...
  • District which contains Rothenburg ob der Tauber (RBG) is Middle Franconia ("Mittelfranken") in the state of Bavaria ("Bayern").

Distance from each district to our point:

In [24]:
# Obtain distance from each district to point by using .geometry.distance(var)
districts.geometry.distance(RBG).head()
Out[24]:
0    1.737935
1    0.577611
2    0.037461
3    0.739213
4    0.000000
dtype: float64

Plotting the shapefile

  • Colorscheme can be customized by changing cmap attribute
In [25]:
districts.plot(cmap='viridis', figsize=(4,6))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x14179470>

5. Working with multiple data sets

Import of two additional files to show population centers and roads in Germany

In [26]:
# Naturalearth dataset provides us with global data which we will have to restrict to the coordinates for Germany 
pop = gpd.read_file("data\\ne_10m_populated_places.shp")
pop.head(2)
Out[26]:
SCALERANK NATSCALE LABELRANK FEATURECLA NAME NAMEPAR NAMEALT DIFFASCII NAMEASCII ADM0CAP ... wof_id CAPALT NAME_EN NAME_DE NAME_ES NAME_FR NAME_PT NAME_RU NAME_ZH geometry
0 10 1 8 Admin-1 capital Colonia del Sacramento 0 Colonia del Sacramento 0.0 ... 421199749 0 Colonia del Sacramento Colonia del Sacramento Colonia del Sacramento Colonia del Sacramento Colônia do Sacramento Колония-дель-Сакраменто 科洛尼亞德爾薩克拉門托 POINT (-57.84000247340134 -34.47999900541754)
1 10 1 8 Admin-1 capital Trinidad 0 Trinidad 0.0 ... 890444639 0 Trinidad Trinidad Trinidad Trinidad Trinidad Тринидад 特立尼達 POINT (-56.90099656015872 -33.5439989373607)

2 rows × 103 columns

Importing road map data from natural earth database

In [27]:
# As with the population dataset, the road map data provides us with global data
rd = gpd.read_file("data\\ne_10m_roads.shp")
rd.head(1)
Out[27]:
scalerank featurecla type sov_a3 note edited name namealt namealtt routeraw ... rwdb_rd_id orig_fid prefix uident continent expressway level min_zoom min_label geometry
0 8 Road Secondary Highway CAN Version 1.5: Changed alignment, a few adds in ... ... 0 0 314705 North America 0 7.1 9.6 LINESTRING (-133.3253277398332 62.215709333037...

1 rows × 32 columns

In [28]:
# Grouping by the different continents included in the file:
rd.groupby('continent').count()
Out[28]:
scalerank featurecla type sov_a3 note edited name namealt namealtt routeraw ... add rwdb_rd_id orig_fid prefix uident expressway level min_zoom min_label geometry
continent
Africa 6761 6761 6761 6761 6761 6761 6761 6761 6761 6761 ... 6761 6761 6761 6761 6761 6761 6761 6761 6761 6761
Asia 25505 25505 25505 25505 25505 25505 25505 25505 25505 25505 ... 25505 25505 25505 25505 25505 25505 25505 25505 25505 25505
Europe 9849 9849 9849 9849 9849 9849 9849 9849 9849 9849 ... 9849 9849 9849 9849 9849 9849 9849 9849 9849 9849
North America 9238 9238 9238 9238 9238 9238 9238 9238 9238 9238 ... 9238 9238 9238 9238 9238 9238 9238 9238 9238 9238
North America x-fade 188 188 188 188 188 188 188 188 188 188 ... 188 188 188 188 188 188 188 188 188 188
Oceania 856 856 856 856 856 856 856 856 856 856 ... 856 856 856 856 856 856 856 856 856 856
South America 4204 4204 4204 4204 4204 4204 4204 4204 4204 4204 ... 4204 4204 4204 4204 4204 4204 4204 4204 4204 4204

7 rows × 31 columns

In [29]:
# Extracting only European roadways as would be done in Pandas:
EU = rd.loc[rd['continent'] == "Europe"]
EU.head(2)
Out[29]:
scalerank featurecla type sov_a3 note edited name namealt namealtt routeraw ... rwdb_rd_id orig_fid prefix uident continent expressway level min_zoom min_label geometry
4186 4 Road Major Highway New in version 2.0.0 _untitled_262 ... 0 0 0 Europe 1 4.0 7.0 LINESTRING (23.84246506528342 42.9061918097508...
4187 3 Road Major Highway New in version 2.0.0 31 E31 A61 ... 0 0 E 0 Europe 1 E 3.0 6.0 LINESTRING (8.314632456881025 49.5508332150352...

2 rows × 32 columns

6.) Multiple layers

a) Joining two datasets:

In [30]:
ax = EU.plot(linewidth = 0.5, color = 'w', figsize = (4,6))
districts.plot(ax=ax, cmap = "viridis", markersize = 5)
ax.set(xlim=(5.883,15.88), ylim=(47.12, 55.84));
  • The code used above will adjust the plot to the figuresize, however, if ax.set_axis_off() is used, the size will remain true to the original file's sizing.

b) Joining three datasets together on one plot:

In [31]:
ax = EU.plot(linewidth = 0.5, color = 'grey', figsize = (8,10))
districts.plot(ax=ax, cmap = "viridis", markersize = 5)
pop.plot(ax=ax, linewidth = 0.25, color = 'r', markersize = 15, figsize = (8,10))
ax.set(xlim=(5.883,15.88), ylim=(47.12, 55.84));
  • The map above shows us all three plots layered in one image, but due to color scheme it is difficult to make out different districts, population centers, etc.

Use of different color schemes can be helpful in differentiating between the data points.

  • Since we have used global data, it is not very detailed on a national level (i.e. the road map data).
  • To use different color schemes for an attribute use of column = 'COLUMN_NAME' in the applicable dataset can be used.

Classification of population centers

In [32]:
# Instead of using param of 'color=' we will use 'column=COLUMN_NAME' to allow us to color code specific values of the datapoints
# legend = True will display the legend on the picture
ax =districts.plot( cmap='pink',  markersize = 5, edgecolor = 'grey', figsize = (10,10)) 
rd.plot(ax=ax,linewidth = 1, column = "type", cmap = 'Greys')
pop.plot(ax=ax,linewidth = 0.5, column = 'FEATURECLA', cmap= 'cool', markersize = 30, legend = True)
ax.set(xlim=(5.883,15.88), ylim=(47.12, 55.84));

7. Coordinate reference systems

  • GeoDataFrame or GeoSeries has the .crs attribute which holds the description of the coordinate reference system (CRS) of the geometric objects. The coordinate reference system can be converted using the _crs function.
In [33]:
EU.crs
Out[33]:
{'init': u'epsg:4326'}
  • The current CRS of our data is based on the World Geodetic System 1984 (WGS84). EPSG (European Petroleum Survey Group) number 4326 belongs to the collection of definitions of CRS' in WGS84. This Earth-fixed reference system is based on static constants and model parameters describing the Earth's shape, gravity, size, and geomagnetic fields. (https://www.esri.com/en-us/home)
In [34]:
EU.geometry.head(3)
Out[34]:
4186    LINESTRING (23.84246506528342 42.9061918097508...
4187    LINESTRING (8.314632456881025 49.5508332150352...
4188    LINESTRING (8.112537509765129 49.7711373578970...
Name: geometry, dtype: object
  • The output shows us that the values are describing longitude and latitude of the coordinates.

Although EPSG 4326 is the most commonly used CRS, due to its global lat/lon references, the dataset can easily be converted to and/or from the existing crs to another:

In [35]:
# Converting from EPSG:4326 (lon/lat) to EPGS:32631 (units in meters)
EU2 = EU.to_crs(epsg=32631)
In [36]:
EU2.geometry.head(3)
Out[36]:
4186    LINESTRING (2203492.659569045 4966305.94317193...
4187    LINESTRING (884333.6129756082 5502274.73182758...
4188    LINESTRING (868055.9726472022 5525737.45639955...
Name: geometry, dtype: object
  • Now the values in the geometric column have been converted to a different crs based on units in meters.

8. Working with city based dataset

In [37]:
# Importing the data
ATL = gpd.read_file("data\Housing_and_Transportation_Affordability_Type_8_Family.shp")
In [38]:
ATL.head(2)
Out[38]:
OBJECTID GEOID10 NAME10 PLNG_REGIO Median_hou Type_8__An GlobalID last_edite Pop_Chng_p Tp8_HousTr Tp8_Housin Tp8_Transp Tp8_Annual geometry
0 1 13045910103 9101.03 ARC 20 57188 4.60 {BB0FBD18-1DE0-460B-8015-13BB6C9923BA} 2015-06-02T14:23:55.000Z 100.3 47.42 25.10 22.32 32567 POLYGON ((-84.95479536437216 33.71964040675272...
1 2 13045910104 9101.04 ARC 20 54987 33.13 {298B4E45-3352-4590-B3D2-D7B1D1522D8F} 2015-06-02T14:23:55.000Z 201.1 46.98 24.73 22.25 32394 POLYGON ((-84.94540236226057 33.72199040708714...
In [39]:
# CRS type
ATL.crs
Out[39]:
{'init': u'epsg:4326'}

a) Selecting different columns and/or colorschemes to depict map differently based on contained data

In [40]:
# Showing map with colorscheme based on column depicting: Median household income
ATL.plot(cmap = 'coolwarm', column = "Median_hou", figsize = (10,10), legend = True)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x2f141198>
In [41]:
# Different colorschemes may emphasize data differences better: cmap = COLORSCHEME
ATL.plot(column = "Median_hou", cmap = 'cubehelix', figsize = (10,10), legend = True)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aea81d0>
In [42]:
# Annual Transit Trips column
ATL.plot(cmap = 'coolwarm', column = "Type_8__An", figsize = (10,10), legend = True)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x2a51c550>
  • Highest public transit use is within the immediate Atlanta metro.
In [43]:
# Annual Vehicle miles traveled column
ATL.plot(cmap = 'coolwarm', column = "Tp8_Annual", figsize = (10,10), legend = True)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x2af57630>
  • Individuals outside of the immediate metro area tend to travel via vehicle rather than public transit.
In [44]:
# Percent of housing costs of household income column 
ATL.plot(cmap = 'coolwarm', column = "Tp8_Housin", figsize = (10,10), legend = True)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x24d08710>
In [45]:
# Showing only geometry column data contained in the dataset
ATL.geometry.head()
Out[45]:
0    POLYGON ((-84.95479536437216 33.71964040675272...
1    POLYGON ((-84.94540236226057 33.72199040708714...
2    POLYGON ((-85.07941439313719 33.59412637518832...
3    POLYGON ((-84.90140834210276 33.57345437731921...
4    POLYGON ((-84.93233335932916 33.75177941452344...
Name: geometry, dtype: object

b) Adding Layers

  • Using the same concept as above indicating the median annual household income for the various districts and placing the road map on top.
  • Notice: in the road map plot column="COLUMN_NAME" will differentiate between different values of the column data which can then be further depicted in the legend
In [46]:
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (10,10), legend = True)
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5, legend = True)
ax.set(xlim=(-85.35,-83.5,), ylim=(33, 34.6))
Out[46]:
[(33, 34.6), (-85.35, -83.5)]
  • Road map overlay shows the different type of roads going through the Atlanta metro.

9. Converting a CSV/Excel file to GeoSeries

In [47]:
# Importing Excel file using Pandas
apd = pd.read_excel("data\\COBRADATA2016.xlsx")
apd.head(2)
Out[47]:
MI_PRINX offense_id rpt_date occur_date occur_time poss_date poss_time beat apt_office_prefix apt_office_num ... dispo_code MaxOfnum_victims Shift Avg Day loc_type crimes neighborhood npu x y
0 5637549 150102493 01/10/2016 01/10/2016 22:00:00 01/10/2016 22:00:00 511 NaN NaN ... NaN 1.0 Eve Sun 21.0 BURGLARY-NONRES Downtown M -84.39487 33.75757
1 5641270 150611492 03/01/2016 02/25/2016 12:00:00 02/29/2016 19:00:00 412 NaN NaN ... NaN 1.0 Unk Unk 20.0 LARCENY-NON VEHICLE Vine City L -84.41411 33.75823

2 rows × 23 columns

Convert the Pandas DataFrame (containing x and y columns instead of a single geometry column) to a GeoDataFrame with a geometry column that combines x and y columns:

In [48]:
# Appending the two columns into one geometry column
geometry = [Point(xy) for xy in zip(apd.x, apd.y)]
# Dropping the x and y columns from the dataframe
apd = apd.drop(['x', 'y','apt_office_prefix', 'apt_office_num', 'dispo_code'], axis=1)
# Initializing the lat / long coordinates as WGS84
crs = {'init': 'epsg:4326'}
# Saving the GeoDataFrame in new variable 
gdf = gpd.GeoDataFrame(apd, crs=crs, geometry=geometry)

Adjust coordinates to show area of map that coincides with the imported dataset:

In [49]:
# Since the crime data does not encompass the Atlanta metro and surrounding areas, will 'zoom in' for a closer look:
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (10,10))
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5)
# Add crime dataset plot the base maps
# Notice: column = "COLUMN_NAME" will color code the different values in the specified column
gdf.plot(ax=ax, marker = "*", column = "crimes", markersize = 5, cmap='hot', legend = True)
# Adjust ax.set(xlim=(lat,lat), ylim=(lon,lon)) to the desired coordinates
ax.set(xlim=(-84.56,-84.25,), ylim=(33.6, 33.88))
Out[49]:
[(33.6, 33.88), (-84.56, -84.25)]

Comparison of affluent neighborhoods vs. lower income neighborhoods and the types of crimes committed:

  • By adjusting the coordinates we can specify the scope of the map
In [50]:
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (10,10))
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5, legend = True)
gdf.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, cmap='plasma', legend = True)
# Adjust coordinates to zoom in on a specific area
ax.set(xlim=(-84.465,-84.40,), ylim=(33.825, 33.89))
Out[50]:
[(33.825, 33.89), (-84.465, -84.4)]
  • The different types of crimes are now clearly distinguishable.
In [51]:
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (10,10))
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5, legend = True)
gdf.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, cmap='seismic', legend = True)
ax.set(xlim=(-84.5,-84.425,), ylim=(33.74, 33.81))
Out[51]:
[(33.74, 33.81), (-84.5, -84.425)]
  • Analyzing data for different area by adjusting the coordinates.

10. Extracting specific information from CSV/Excel file to create layers.

  • Creating three different layers for each of the violent crimes and then staggering them on top of the base map layer.
In [52]:
# Creating layer for homicides committed, this extraction of specific data from a column using GeoPandas is similar to Pandas
gdf_h = gdf.loc[gdf['crimes'] == "HOMICIDE" ]
gdf_h.head(2)
Out[52]:
MI_PRINX offense_id rpt_date occur_date occur_time poss_date poss_time beat location MinOfucr MinOfibr_code MaxOfnum_victims Shift Avg Day loc_type crimes neighborhood npu geometry
29005 5696811 163660563113 12/31/2016 12/31/2016 NaN 12/31/2016 05:40:00 109 1090 HOLLYWOOD RD NW 110 0912 NaN Morn Sat NaN HOMICIDE Almond Park G POINT (-84.46084999999998 33.78459)
29006 5696812 163610248112 12/26/2016 12/26/2016 03:30:00 12/26/2016 03:30:00 410 1935 ALISON CT SW 110 0911 NaN Morn Mon NaN HOMICIDE Fort Valley R POINT (-84.45149000000001 33.70055)
In [53]:
gdf_r = gdf.loc[gdf['crimes'] == "RAPE" ]
gdf_r.head(2)
Out[53]:
MI_PRINX offense_id rpt_date occur_date occur_time poss_date poss_time beat location MinOfucr MinOfibr_code MaxOfnum_victims Shift Avg Day loc_type crimes neighborhood npu geometry
15 5666805 160010583 01/01/2016 01/01/2016 02:30:00 01/01/2016 02:45:00 208 3380 PEACHTREE RD NE 210 1103 1.0 Morn Fri NaN RAPE North Buckhead B POINT (-84.36597 33.84892)
472 5667263 160071420 01/07/2016 01/07/2016 09:15:00 01/07/2016 10:45:00 104 1565 MARTIN L. KING JR. DRIVE NW @ MOZLEY PARK 210 1103 1.0 Day Thu 31.0 RAPE Mozley Park K POINT (-84.44235000000001 33.7536)
In [54]:
gdf_a = gdf.loc[gdf['crimes'] == "AGG ASSAULT" ]
gdf_a.head(2)
Out[54]:
MI_PRINX offense_id rpt_date occur_date occur_time poss_date poss_time beat location MinOfucr MinOfibr_code MaxOfnum_victims Shift Avg Day loc_type crimes neighborhood npu geometry
6 5666794 153652888 01/01/2016 12/31/2015 23:05:00 01/01/2016 00:50:00 313 800 HUTCHENS RD SE @APS SOUTH ATLANTA HS 410 1314 1.0 Morn Thu 13.0 AGG ASSAULT South River Gardens Z POINT (-84.36186999999998 33.67102)
7 5666797 160010061 01/01/2016 01/01/2016 00:03:00 01/01/2016 00:09:00 403 1316 AVON AVE SW 410 1314 2.0 Morn Fri 20.0 AGG ASSAULT Venetian Hills S POINT (-84.43107000000001 33.72209)

Adding all three layers to the previously depicted neighborhoods of the metro area.

In [55]:
# Adding layers to higher income neighborhoods
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (5,5))
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5)
gdf_h.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, color='r')
gdf_r.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, color='k')
gdf_a.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, color='b')
ax.set(xlim=(-84.465,-84.40,), ylim=(33.825, 33.89))
C:\Users\nhamler\Anaconda2\lib\site-packages\geopandas\plotting.py:389: UserWarning: Only specify one of 'column' or 'color'. Using 'color'.
  "'color'.", UserWarning)
Out[55]:
[(33.825, 33.89), (-84.465, -84.4)]
  • In the more affluent neighborhood it is now evident that while there were more burglaries and property crimes, there were no violent crimes during 2016.
In [56]:
# Adding the same layers to the moderate income neighborhood
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (5,5))
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5)
gdf_h.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, color='r')
gdf_r.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, color='k')
gdf_a.plot(ax=ax, marker = "o", column = "crimes", markersize = 30, color='b')
ax.set(xlim=(-84.5,-84.425,), ylim=(33.74, 33.81))
Out[56]:
[(33.74, 33.81), (-84.5, -84.425)]
  • Looking at the moderate / lower income neighborhood, the difference between the two neighborhoods is clearly visible.

A final look at the entire Atlanta metro area:

In [57]:
ax = rd.plot(linewidth = 1.5, column ="type", cmap = 'prism', figsize = (5,5))
ATL.plot(ax=ax, column = "Median_hou", cmap = 'coolwarm', markersize = 5)
gdf_h.plot(ax=ax, marker = "o", column = "crimes", markersize = 5, color='r') # Homicides - red
gdf_r.plot(ax=ax, marker = "o", column = "crimes", markersize = 5, color='k') # Rapes - black
gdf_a.plot(ax=ax, marker = "o", column = "crimes", markersize = 5, color='b') # Aggravated Assaults - blue 
ax.set(xlim=(-84.56,-84.25,), ylim=(33.6, 33.88))
Out[57]:
[(33.6, 33.88), (-84.56, -84.25)]
  • High number of aggravated assaults towards the city center are depicted in the plotted map.

12. Creating central points of the contained polygons in the GeoSeries

In [58]:
# Create new variable holding the central points of the geometry column
cents = ATL.centroid
In [59]:
cents.head()
Out[59]:
0    POINT (-84.95858032846071 33.68757095682742)
1    POINT (-84.94885753380844 33.76115226159461)
2     POINT (-85.0901192973005 33.58542884659065)
3    POINT (-84.91916589351241 33.51281168090114)
4    POINT (-84.92062018722537 33.72965547382351)
dtype: object

13. Interactive maps - Bokeh

In [60]:
# Importing Bokeh libraries 
from bokeh.plotting import figure, save
from bokeh.io import output_notebook, show
output_notebook()
from bokeh.plotting import figure, output_file, show
from bokeh.models import GeoJSONDataSource
Loading BokehJS ...
In [61]:
# Initialize plot (p) and assign title
p = figure(title="Hello World")
p
Out[61]:
Figure(
id = 'a4fa9a9e-8409-4541-9f7d-d337bf9fb23e', …)

a) Using ColumnDataSource as geo source

In [62]:
# Importing ColumnDataSource to allow Bokeh to read and store the data
from bokeh.models import ColumnDataSource
In [63]:
# Unlike Shapely, Bokeh can't read files which have geometry objects 
inter = gpd.read_file("data\\ne_10m_populated_places.shp")
inter.head()
Out[63]:
SCALERANK NATSCALE LABELRANK FEATURECLA NAME NAMEPAR NAMEALT DIFFASCII NAMEASCII ADM0CAP ... wof_id CAPALT NAME_EN NAME_DE NAME_ES NAME_FR NAME_PT NAME_RU NAME_ZH geometry
0 10 1 8 Admin-1 capital Colonia del Sacramento 0 Colonia del Sacramento 0.0 ... 421199749 0 Colonia del Sacramento Colonia del Sacramento Colonia del Sacramento Colonia del Sacramento Colônia do Sacramento Колония-дель-Сакраменто 科洛尼亞德爾薩克拉門托 POINT (-57.84000247340134 -34.47999900541754)
1 10 1 8 Admin-1 capital Trinidad 0 Trinidad 0.0 ... 890444639 0 Trinidad Trinidad Trinidad Trinidad Trinidad Тринидад 特立尼達 POINT (-56.90099656015872 -33.5439989373607)
2 10 1 8 Admin-1 capital Fray Bentos 0 Fray Bentos 0.0 ... 890451703 0 Fray Bentos Fray Bentos Fray Bentos Fray Bentos Fray Bentos Фрай-Бентос 弗賴本托斯 POINT (-58.3039974719095 -33.1389990288435)
3 10 1 8 Admin-1 capital Canelones 0 Canelones 0.0 ... 890444649 0 Canelones Canelones Canelones Canelones Canelones Канелонес 卡內洛內斯 POINT (-56.28400149324307 -34.53800400667547)
4 10 1 8 Admin-1 capital Florida 0 Florida 0.0 ... 890431207 0 Florida Florida Florida Florida Florida Флорида 佛羅里達 POINT (-56.21499844799416 -34.09900200521719)

5 rows × 103 columns

In [64]:
# Getting separate x and y values from the geometry objects contained in the file via the function getPointCoords
def getPointCoords(row, geom, coord_type):
    if coord_type =='x':
        return row[geom].x
    elif coord_type =='y':
        return row[geom].y
In [65]:
# Calculating coordinates for each column:
inter['x'] = inter.apply(getPointCoords,geom='geometry', coord_type='x', axis=1)
inter['y'] = inter.apply(getPointCoords,geom='geometry', coord_type='y', axis=1)
In [66]:
# Drop the geometry column to allow for Bokeh to read the file correctly
inter2 = inter.drop('geometry', axis=1).copy()
inter2.head(2)
Out[66]:
SCALERANK NATSCALE LABELRANK FEATURECLA NAME NAMEPAR NAMEALT DIFFASCII NAMEASCII ADM0CAP ... CAPALT NAME_EN NAME_DE NAME_ES NAME_FR NAME_PT NAME_RU NAME_ZH x y
0 10 1 8 Admin-1 capital Colonia del Sacramento 0 Colonia del Sacramento 0.0 ... 0 Colonia del Sacramento Colonia del Sacramento Colonia del Sacramento Colonia del Sacramento Colônia do Sacramento Колония-дель-Сакраменто 科洛尼亞德爾薩克拉門托 -57.840002 -34.479999
1 10 1 8 Admin-1 capital Trinidad 0 Trinidad 0.0 ... 0 Trinidad Trinidad Trinidad Trinidad Trinidad Тринидад 特立尼達 -56.900997 -33.543999

2 rows × 104 columns

In [67]:
# Creating the ColumnDataSource object
psource = ColumnDataSource(inter2)
psource
Out[67]:
ColumnDataSource(
id = '581b8399-08e8-4263-a3b5-ae371dba9cd9', …)
In [68]:
# Initializing plot figure
p = figure(title="A map of the World!")
In [69]:
# Adding points to the plot via our ColumnDataSource object
p.circle('x', 'y', source=psource, color='blue', size=1)
Out[69]:
GlyphRenderer(
id = 'e57e960f-e954-4743-b21d-6b02f2afce3d', …)
In [70]:
# File path 
outfp = r"data/point_map2.html"

# Save the map
save(p, outfp)
C:\Users\nhamler\Anaconda2\lib\site-packages\bokeh\io.py:527: UserWarning: save() called but no resources were supplied and output_file(...) was never called, defaulting to resources.CDN
  warnings.warn("save() called but no resources were supplied and output_file(...) was never called, defaulting to resources.CDN")
C:\Users\nhamler\Anaconda2\lib\site-packages\bokeh\io.py:537: UserWarning: save() called but no title was supplied and output_file(...) was never called, using default title 'Bokeh Plot'
  warnings.warn("save() called but no title was supplied and output_file(...) was never called, using default title 'Bokeh Plot'")
Out[70]:
'C:\\Users\\nhamler\\Desktop\\665\\Tutorial_GeoPandas\\data\\point_map2.html'
In [71]:
show (p)

b) GeoJSONDataSource

  • Seamless transition to newer GeoJsONDataSource functionality instead of ColumnDataSource
In [72]:
# Importing Sample Dataset included in Bokeh
from bokeh.sampledata.sample_geojson import geojson
In [73]:
# Mapping of sample data using GeoJSONDataSource
geo_source = GeoJSONDataSource(geojson=geojson)

# Initializing figure
p = figure()
p.circle(x='x', y='y', alpha=0.9, source=geo_source) 
Out[73]:
GlyphRenderer(
id = 'ba3fdd21-6841-4382-88ca-fab840c77153', …)
In [74]:
# Assigning file path
output_file("geojson.html")
show(p)

c) Bokeh and Google Maps

  • Bokeh's GMapPlot supports Googlemaps and enables Bokeh plots to be layered over Google Maps

Creating custom points to add to Google Map

In [75]:
# Import Bokeh functions
from bokeh.io import output_file, show
from bokeh.models import (GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool)
In [76]:
# Display options of the Google Map
map_options = GMapOptions(lat=33.74, lng=-84.3880, map_type="roadmap", zoom=11)
In [77]:
# Assigning plot
plot = GMapPlot(
    x_range=DataRange1d(), y_range=DataRange1d(), map_options=map_options)
In [78]:
# Assigning title
plot.title.text = "Atlanta"
In [79]:
# Enter API key obtained from Google
plot.api_key = "AIzaSyASeDwXJrTLdxsgtTfdpUAfMxdtl8mZHZw"
In [80]:
# Creating points to be added to the map
source = ColumnDataSource(
    data=dict(
        lat=[33.74316,33.74319, 33.74314],
        lon=[-84.36, -84.32, -84.39],))
In [81]:
# Initializing Point display options
circle = Circle(x="lon", y="lat", size=15, fill_color="blue", fill_alpha=0.8, line_color=None)
In [82]:
# Adding points and display options to the plot
plot.add_glyph(source, circle)
Out[82]:
GlyphRenderer(
id = '17b026e6-aac3-4d3d-8365-31071c2351e5', …)
In [83]:
# Add customization tool options for plot display
plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())
# Create file path
output_file("gmap_plot.html")
show(plot)

Adding crime dataset to Google Map:

In [84]:
crime = pd.read_excel("data\COBRADATA2016.xlsx")
crime.head(2)
Out[84]:
MI_PRINX offense_id rpt_date occur_date occur_time poss_date poss_time beat apt_office_prefix apt_office_num ... dispo_code MaxOfnum_victims Shift Avg Day loc_type crimes neighborhood npu x y
0 5637549 150102493 01/10/2016 01/10/2016 22:00:00 01/10/2016 22:00:00 511 NaN NaN ... NaN 1.0 Eve Sun 21.0 BURGLARY-NONRES Downtown M -84.39487 33.75757
1 5641270 150611492 03/01/2016 02/25/2016 12:00:00 02/29/2016 19:00:00 412 NaN NaN ... NaN 1.0 Unk Unk 20.0 LARCENY-NON VEHICLE Vine City L -84.41411 33.75823

2 rows × 23 columns

In [85]:
# Creating dictionary to hold the values for coordinates and point details
crime_source = ColumnDataSource(data=dict(x=crime['x'],
                                      y=crime['y'],
                                      neighborhood=crime['neighborhood'].values,
                                      crimes=crime['crimes'].values,
                                      victims=crime['MaxOfnum_victims'].values))
In [86]:
# Initializing the map options for the google map
map_options = GMapOptions(lat=33.74, lng=-84.3880, map_type="roadmap", zoom=11)
In [87]:
# Creating the GMapPlot based on the Google Map
plot = GMapPlot(
    x_range=DataRange1d(), y_range=DataRange1d(), map_options=map_options)
In [88]:
# Assinging title
plot.title.text = "Atlanta"
In [89]:
# Entering API key to enable Google Map functionality
plot.api_key = "AIzaSyASeDwXJrTLdxsgtTfdpUAfMxdtl8mZHZw"
In [90]:
# Display options for points
circle = Circle(x="x", y="y", size=2.5, fill_color="black", fill_alpha=0.8, line_color=None)
In [91]:
# Adding crime data and display options to the plot
plot.add_glyph(crime_source, circle)
Out[91]:
GlyphRenderer(
id = '90a03a8f-7766-4877-a075-af7d869d83c0', …)
In [92]:
plot.add_tools(PanTool(), WheelZoomTool(), BoxSelectTool())
output_file("gmap_plot.html")
In [93]:
show(plot)

d) Hover Tool

Adding the hover tool to display neighborhood, type of crime, and number of victims for each of the points.

In [94]:
# Importing the HoverTool
from bokeh.models import HoverTool
In [95]:
# Initializing the HoverTool and the information to be shown for each point
my_hover = HoverTool()
In [96]:
my_hover.tooltips = [("Crime:","@crimes"), ("Neighborhood:","@neighborhood"), ("No. of victims:", "@victims")]
In [97]:
# Adding the hover tool to the existing map
plot.add_tools(my_hover)
In [98]:
show(plot)